feat(arrow): expose Arrow IPC reader via registerArrow and readArrow#52
Open
LantaoJin wants to merge 1 commit into
Open
feat(arrow): expose Arrow IPC reader via registerArrow and readArrow#52LantaoJin wants to merge 1 commit into
LantaoJin wants to merge 1 commit into
Conversation
Closes apache#37 Mirrors the existing parquet/csv/json reader pattern for Arrow IPC files. Adds: - proto/arrow_read_options.proto with the ArrowReadOptionsProto message (file_extension only; explicit Arrow schema rides on the existing schema-IPC byte channel rather than this proto, matching the other formats) - ArrowReadOptions Java builder with fileExtension default ".arrow" and schema(Schema) - SessionContext.registerArrow(name, path[, options]) and readArrow(path[, options]) overloads with null-argument validation per Andy's apache#47 review feedback - native/src/arrow.rs JNI module that decodes the proto and dispatches to upstream SessionContext::register_arrow / read_arrow Note on ArrowReadOptions construction: upstream's ArrowReadOptions exposes file_extension as a public field (not a builder setter), unlike the other format options. The native side uses struct-update syntax to set it without tripping clippy's field_reassign_with_default lint. Tests cover proto round-trip, schema-by-reference, register/read on a fixture written by arrow-vector's ArrowFileWriter (the canonical Arrow IPC file format DataFusion's source supports), custom file extension, explicit Arrow schema, and null-argument rejection on both register and read. Out of scope: tablePartitionCols (no parquet/csv/json analog on the Java side yet). Arrow IPC carries body compression inside the file format itself, so unlike CSV and NDJSON there is no FileCompressionType on this options class.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
DataFusion 53.x supports Arrow IPC files via
SessionContext::register_arrow/read_arrow, but the Java bindings only expose Parquet, CSV, and (in #47) NDJSON. Since JVM results already come back as Arrow batches via the C Data Interface, an Arrow IPC reader on the Java side closes the natural round-trip: Java callers can write Arrow IPC to disk with arrow-vector'sArrowFileWriter, then read it back through DataFusion without going through Parquet or any other intermediate format. Today they have to fall back toCREATE EXTERNAL TABLE … STORED AS ARROWvia SQL, which works but bypasses the typed builder pattern.This PR is the Java surface for the existing upstream functionality. Issue #37 tracks it; the implementation follows the same proto-over-JNI pattern as #47 (NDJSON), #29 (the CSV/Parquet refactor), and the merged CSV/Parquet readers.
What changes are included in this PR?
proto/arrow_read_options.proto— newArrowReadOptionsProtomessage. Single field:file_extension(default.arrow). Explicit Arrow schema rides on the existing IPC byte channel through the JNI layer, mirroring the parquet/csv/json paths, and is therefore not encoded in this message. NoFileCompressionTypefield — Arrow IPC files carry body compression (LZ4_FRAME / ZSTD per-buffer) inside the file format itself.ArrowReadOptionsJava builder withfileExtension(String)andschema(Schema)setters.SessionContext.registerArrow(name, path[, options])andreadArrow(path[, options])overloads, structurally identical to the parquet/csv/json entry points.native/src/arrow.rs— JNI module that decodesArrowReadOptionsProto, constructs the upstreamArrowReadOptions, and forwards toregister_arrow/read_arrow. ImportsArrowReadOptionsfromdatafusion::execution::optionsrather thanprelude(it's not re-exported there, same situation asJsonReadOptions).Out of scope (for follow-ups):
tablePartitionCols: neither parquet, csv, nor ndjson exposes Hive-style partitioning on the Java side yet. Adding it for Arrow only would diverge.Are these changes tested?
Yes, 9 new tests across
ArrowReadOptionsTestandSessionContextArrowTest.Are there any user-facing changes?
Yes, purely additive. New public API:
org.apache.datafusion.ArrowReadOptionsSessionContext.registerArrow(String, String)SessionContext.registerArrow(String, String, ArrowReadOptions)SessionContext.readArrow(String) → DataFrameSessionContext.readArrow(String, ArrowReadOptions) → DataFrameThe new
org.apache.datafusion.protobuf.ArrowReadOptionsProtogenerated class is also exposed via the protobuf-Java output, consistent with howCsvReadOptionsProto,NdJsonReadOptionsProto, andParquetReadOptionsProtoare exposed. No API removals, no deprecations, no behavior change for existing callers.